I am using the dataset from the pokemontcg.io API service. They have a github repository with all the data in JSON format that can be downloaded directly. After downloading and extracting the data, here is my project file structure:
We see all the card data is seperated into their respective set. So lets first see how many cards we are working with by looping through all the JSON files in the data/json/cards/en/ directory and counting the number of cards in each file.
import osimport jsonall_cards_dataset = []json_files = os.listdir('data/json/cards/en/')forfilein json_files:iffile.endswith('.json'):withopen(os.path.join('data/json/cards/en/', file), 'r') as f: data = json.load(f) all_cards_dataset.extend(data)print(f"Total number of cards in dataset: {len(all_cards_dataset)}")
Total number of cards in dataset: 19653
For this project 19653 cards is too much for this analysis. I am going to take only the pokemon cards from the pokemons of the first generation (the original 151 pokemons) to perform my analysis. This should give me enough data to work with and allows the data to span through multiple expansions and sets while keeping the dataset small. In future analysis I might expand to all pokemon cards.
Lets first convert all the JSON files to a single CSV file containing all the pokemon cards and then filter that CSV file to only contain the first generation pokemons.
import pandas as pddf = pd.json_normalize(all_cards_dataset)df.to_csv('data/all_pokemon_cards.csv', index=False)
To find all the pokemon cards related to the pokemons from the first generation, I need a list of the first generation pokemons. I found a list on wikipedia and copied them into a txt file. This gives me a list of the pokemon names to filter the cards by. Here is the list below:
We can then use this list to filter the CSV file we created earlier to only contain the first generation pokemons. Since pokemon names could be named: “Cool Porygon” we will use the str.contains method to filter the names that contain any of the first generation pokemon names. Although this is not perfect, it should give us a good enough dataset to work with.
# This text file contains the names of all first generation pokemonswithopen('data/first_gen_pokemons.txt', 'r') as f: first_gen_pokemons = f.read().split() first_gen_pokemons ='|'.join(first_gen_pokemons)df = pd.DataFrame(all_cards_dataset)filtered_df = df[df['name'].str.contains(first_gen_pokemons)]filtered_df = filtered_df[filtered_df['supertype'] =='Pokémon'] # Keep only Pokémon cards
Shape of our dataframe: (4470, 25)
We have taken our intial dataset of 19653 cards and filtered it down to 4470 cards. And now that we have filtered the dataset, we can move on to cleaning and processing the data. Lets save this filtered dataset to a CSV file for part 2.